In [ ]:
%%HTML
<script src="require.js"></script>

Project Overview:¶

This project involves a comprehensive analysis of loan data, specifically focusing on data from ID/X Partners. The dataset contains various features related to loans, including loan amounts, interest rates, borrower information, and loan status. The goal is to gain insights into the factors influencing loan performance, visualize trends over time, and make informed suggestions for optimizing business strategies.

Import necessary libraries¶

In [ ]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default='notebook'

Data preprocessing¶

In [ ]:
# Load the dataframe
loan_data = pd.read_csv(r'/workspaces/idx_finalproject/resources/loan_data_2007_2014.csv', low_memory=False)
In [ ]:
# Print dataframe head row
loan_data.head()
Out[ ]:
Unnamed: 0 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade ... total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m
0 0 1077501 1296599 5000 5000 4975.0 36 months 10.65 162.87 B ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 1077430 1314167 2500 2500 2500.0 60 months 15.27 59.83 C ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 1077175 1313524 2400 2400 2400.0 36 months 15.96 84.33 C ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 1076863 1277178 10000 10000 10000.0 36 months 13.49 339.31 C ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 4 1075358 1311748 3000 3000 3000.0 60 months 12.69 67.79 B ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 75 columns

In [ ]:
# Display dataframe info
loan_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 466285 entries, 0 to 466284
Data columns (total 75 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Unnamed: 0                   466285 non-null  int64  
 1   id                           466285 non-null  int64  
 2   member_id                    466285 non-null  int64  
 3   loan_amnt                    466285 non-null  int64  
 4   funded_amnt                  466285 non-null  int64  
 5   funded_amnt_inv              466285 non-null  float64
 6   term                         466285 non-null  object 
 7   int_rate                     466285 non-null  float64
 8   installment                  466285 non-null  float64
 9   grade                        466285 non-null  object 
 10  sub_grade                    466285 non-null  object 
 11  emp_title                    438697 non-null  object 
 12  emp_length                   445277 non-null  object 
 13  home_ownership               466285 non-null  object 
 14  annual_inc                   466281 non-null  float64
 15  verification_status          466285 non-null  object 
 16  issue_d                      466285 non-null  object 
 17  loan_status                  466285 non-null  object 
 18  pymnt_plan                   466285 non-null  object 
 19  url                          466285 non-null  object 
 20  desc                         125981 non-null  object 
 21  purpose                      466285 non-null  object 
 22  title                        466264 non-null  object 
 23  zip_code                     466285 non-null  object 
 24  addr_state                   466285 non-null  object 
 25  dti                          466285 non-null  float64
 26  delinq_2yrs                  466256 non-null  float64
 27  earliest_cr_line             466256 non-null  object 
 28  inq_last_6mths               466256 non-null  float64
 29  mths_since_last_delinq       215934 non-null  float64
 30  mths_since_last_record       62638 non-null   float64
 31  open_acc                     466256 non-null  float64
 32  pub_rec                      466256 non-null  float64
 33  revol_bal                    466285 non-null  int64  
 34  revol_util                   465945 non-null  float64
 35  total_acc                    466256 non-null  float64
 36  initial_list_status          466285 non-null  object 
 37  out_prncp                    466285 non-null  float64
 38  out_prncp_inv                466285 non-null  float64
 39  total_pymnt                  466285 non-null  float64
 40  total_pymnt_inv              466285 non-null  float64
 41  total_rec_prncp              466285 non-null  float64
 42  total_rec_int                466285 non-null  float64
 43  total_rec_late_fee           466285 non-null  float64
 44  recoveries                   466285 non-null  float64
 45  collection_recovery_fee      466285 non-null  float64
 46  last_pymnt_d                 465909 non-null  object 
 47  last_pymnt_amnt              466285 non-null  float64
 48  next_pymnt_d                 239071 non-null  object 
 49  last_credit_pull_d           466243 non-null  object 
 50  collections_12_mths_ex_med   466140 non-null  float64
 51  mths_since_last_major_derog  98974 non-null   float64
 52  policy_code                  466285 non-null  int64  
 53  application_type             466285 non-null  object 
 54  annual_inc_joint             0 non-null       float64
 55  dti_joint                    0 non-null       float64
 56  verification_status_joint    0 non-null       float64
 57  acc_now_delinq               466256 non-null  float64
 58  tot_coll_amt                 396009 non-null  float64
 59  tot_cur_bal                  396009 non-null  float64
 60  open_acc_6m                  0 non-null       float64
 61  open_il_6m                   0 non-null       float64
 62  open_il_12m                  0 non-null       float64
 63  open_il_24m                  0 non-null       float64
 64  mths_since_rcnt_il           0 non-null       float64
 65  total_bal_il                 0 non-null       float64
 66  il_util                      0 non-null       float64
 67  open_rv_12m                  0 non-null       float64
 68  open_rv_24m                  0 non-null       float64
 69  max_bal_bc                   0 non-null       float64
 70  all_util                     0 non-null       float64
 71  total_rev_hi_lim             396009 non-null  float64
 72  inq_fi                       0 non-null       float64
 73  total_cu_tl                  0 non-null       float64
 74  inq_last_12m                 0 non-null       float64
dtypes: float64(46), int64(7), object(22)
memory usage: 266.8+ MB
In [ ]:
# Drop irrelevant columns
loan_data.drop(['Unnamed: 0', 'url'], axis=1, inplace=True)
In [ ]:
# Convert date-value columns to datetime with explicit date format
loan_data['issue_d'] = pd.to_datetime(loan_data['issue_d'], format='%b-%y')
loan_data['next_pymnt_d'] = pd.to_datetime(loan_data['next_pymnt_d'], format='%b-%y')
loan_data['last_pymnt_d'] = pd.to_datetime(loan_data['last_pymnt_d'], format='%b-%y')
loan_data['last_credit_pull_d'] = pd.to_datetime(loan_data['last_credit_pull_d'], format='%b-%y')

# Print datetime columns
print(loan_data[['issue_d', 'next_pymnt_d', 'last_pymnt_d', 'last_credit_pull_d']])
          issue_d next_pymnt_d last_pymnt_d last_credit_pull_d
0      2011-12-01          NaT   2015-01-01         2016-01-01
1      2011-12-01          NaT   2013-04-01         2013-09-01
2      2011-12-01          NaT   2014-06-01         2016-01-01
3      2011-12-01          NaT   2015-01-01         2015-01-01
4      2011-12-01   2016-02-01   2016-01-01         2016-01-01
...           ...          ...          ...                ...
466280 2014-01-01   2016-02-01   2016-01-01         2016-01-01
466281 2014-01-01          NaT   2014-12-01         2016-01-01
466282 2014-01-01   2016-02-01   2016-01-01         2015-12-01
466283 2014-01-01          NaT   2014-12-01         2015-04-01
466284 2014-01-01   2016-02-01   2016-01-01         2016-01-01

[466285 rows x 4 columns]
In [ ]:
# Handling missing data

# Identify numerical and categorical columns
numerical_columns = loan_data.select_dtypes(include=['float64', 'int64']).columns
categorical_columns = loan_data.select_dtypes(include=['object']).columns

# Fill missing values in numerical columns with the mean
loan_data[numerical_columns] = loan_data[numerical_columns].fillna(loan_data[numerical_columns].mean())

# Fill missing values in categorical columns with "Not specified"
loan_data[categorical_columns] = loan_data[categorical_columns].fillna("Not specified")

Initial findings¶

In [ ]:
# Initial findings

# Select columns to find initial insights
use_cols = ['loan_amnt', 'funded_amnt', 'int_rate', 'grade', 
            'annual_inc', 'dti', 'open_acc', 'pub_rec', 
            'revol_bal', 'total_acc', 'loan_status', 'int_rate',
            'issue_d', 'next_pymnt_d', 'last_pymnt_d']

df = loan_data[use_cols]
In [ ]:
# Visualize loan grade distribution
plt.figure(figsize=(10, 6))
grades_order = sorted(df['grade'].unique())
grade_counts = df['grade'].value_counts()[grades_order]
plt.bar(grades_order, grade_counts)
plt.title('Distribution of Loan Grades')
plt.xlabel('Loan Grade')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Visualize the distribution of issue date based on loan status
plt.figure(figsize=(10, 6))
loan_status_values = df['loan_status'].unique()
for status in loan_status_values:
    plt.hist(df[df['loan_status'] == status]['issue_d'], bins=30, alpha=0.5, label=status)
plt.title('Distribution of Loan Issue Dates by Loan Status')
plt.xlabel('Issue Date')
plt.ylabel('Number of Loans') 
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Visualize the distribution of loan amounts based on loan status
plt.figure(figsize=(10, 6))
loan_status_values = df['loan_status'].unique()
for status in loan_status_values:
    plt.hist(df[df['loan_status'] == status]['loan_amnt'], bins=30, alpha=0.5, label=status)
plt.title('Distribution of Loan Amounts by Loan Status')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Visualize the distribution of interest rates based on loan status
plt.figure(figsize=(10, 6))
for status in loan_status_values:
    plt.hist(df[df['loan_status'] == status]['int_rate'], bins=20, alpha=0.5, label=status)
plt.title('Distribution of Interest Rates by Loan Status')
plt.xlabel('Interest Rate')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Explore the relationship between annual income and debt-to-income ratio
plt.figure(figsize=(10, 6))
for status in loan_status_values:
    plt.scatter(df[df['loan_status'] == status]['annual_inc'], 
                df[df['loan_status'] == status]['dti'], 
                label=status, alpha=0.7)
plt.title('Relationship between Annual Income, DTI, and Loan Status')
plt.xlabel('Annual Income')
plt.ylabel('Debt-to-Income Ratio')
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Feature engineering¶

In [ ]:
# Feature engineering

# Loan amount to income ratio
loan_data['loan_inc_ratio'] = loan_data['loan_amnt'] / loan_data['annual_inc']

# Distribution plot 
plt.figure(figsize=(10, 6))
loan_data['loan_inc_ratio'].hist(bins=20)
plt.xlabel('Loan to Income Ratio')
plt.title('Distribution of Loan to Income Ratio')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Extract year and month from date columns
loan_data['issue_year'] = loan_data['issue_d'].dt.year
loan_data['issue_month'] = loan_data['issue_d'].dt.month
loan_data['next_pymnt_year'] = loan_data['next_pymnt_d'].dt.year
loan_data['next_pymnt_month'] = loan_data['next_pymnt_d'].dt.month
loan_data['last_pymnt_year'] = loan_data['last_pymnt_d'].dt.year
loan_data['last_pymnt_month'] = loan_data['last_pymnt_d'].dt.month
loan_data['last_credit_pull_year'] = loan_data['last_credit_pull_d'].dt.year
loan_data['last_credit_pull_month'] = loan_data['last_credit_pull_d'].dt.month

# Extracting loan term from 'term' column
loan_data['loan_term'] = loan_data['term'].apply(lambda x: int(x.split()[0]))

# Calculate the average interest rate
avg_interest_rate = loan_data['int_rate'].mean()

# Create a new feature for interest rate difference
loan_data['int_rate_diff'] = loan_data['int_rate'] - avg_interest_rate

# Calculate debt-to-income ratio
loan_data['dti_ratio'] = loan_data['dti'] / loan_data['annual_inc']

# Calculate credit utilization ratio
loan_data['credit_util_ratio'] = loan_data['revol_bal'] / loan_data['total_rev_hi_lim']

# Create a binary indicator for loan status (1 for 'Charged Off', 0 for others)
loan_data['loan_status_binary'] = (loan_data['loan_status'] == 'Charged Off').astype(int)
In [ ]:
# Analyze loan issuance trends over time

# Group data by issue year and month
issue_date_counts = loan_data.groupby(['issue_year', 'issue_month']).size().unstack()

# Calculate default rates by issue year and month
default_rates_by_date = loan_data.groupby(['issue_year', 'issue_month'])['loan_status_binary'].mean().unstack()

# Plot loan issuance trends and default rates side by side horizontally
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Plot loan issuance trends
sns.heatmap(issue_date_counts, cmap='Blues', annot=True, fmt='g', linewidths=.5, ax=axes[0])
axes[0].set_title('Loan Issuance Trends Over Time')
axes[0].set_xlabel('Issue Month')
axes[0].set_ylabel('Issue Year')

# Plot default rates over time
sns.heatmap(default_rates_by_date, cmap='Reds', annot=True, fmt='.2%', ax=axes[1])
axes[1].set_title('Default Rates Over Time')
axes[1].set_xlabel('Issue Month')
axes[1].set_ylabel('Issue Year')

plt.tight_layout()
plt.show()
No description has been provided for this image

The most noticeable insight is that in 2014, there was a significant increase in loan issuances, particularly in October, where over 38,000 loans were issued. This suggests a surge in lending activity during that specific month and year. possibly indicating a strategic move by the lending institution to boost lending during that period. Interestingly, the heatmap shows that the default rates were highest in the years 2011 and 2012, suggesting that loans issued during these periods were more likely to default. The variation in default rates across different years highlights potential patterns or trends that may influence credit risk and loan performance.

In [ ]:
# Get the distribution of loan terms
term_counts = loan_data['loan_term'].value_counts()

# Plot the distribution of loan terms
plt.figure(figsize=(12, 5))

# Plot loan terms
plt.subplot(1, 2, 1)
plt.bar(term_counts.index.astype(str), term_counts, color='skyblue')
plt.title('Distribution of Loan Terms')
plt.xlabel('Loan Term (Months)')
plt.ylabel('Number of Loans')

# Plot a pie chart for loan terms
plt.subplot(1, 2, 2)
plt.pie(term_counts, labels=term_counts.index.astype(str), autopct='%1.1f%%', colors=['lightcoral', 'lightskyblue'])
plt.title('Loan Term Distribution')
plt.axis('equal') 

plt.tight_layout()
plt.show()
No description has been provided for this image

The bar chart and pie chart collectively reveal that the majority of loans in the dataset have a 36-month term, constituting a substantial portion of the loan portfolio. Conversely, the 60-month term, while still popular, represents a smaller proportion of the total loans.

In [ ]:
# Compare individual loan interest rates to the average interest rate

# Plot the distribution of individual loan interest rate differences
plt.figure(figsize=(10, 6))
plt.hist(loan_data['int_rate_diff'], bins=30, alpha=0.5, color='skyblue', label='Interest Rate Differences')
plt.axvline(0, color='red', linestyle='dashed', linewidth=2, label='Average Interest Rate')
plt.title('Distribution of Interest Rate Differences from Average')
plt.xlabel('Interest Rate Difference')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Correlation analysis¶

In [ ]:
# Select relevant columns for correlation analysis
selected_columns_with_engineered = ['loan_amnt', 'funded_amnt', 'int_rate', 'annual_inc',
                                    'dti', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc',
                                    'loan_status_binary', 'issue_year', 'issue_month',
                                    'loan_term', 'int_rate_diff', 'dti_ratio', 'credit_util_ratio']

# Calculate and visualize correlation matrix
correlation_matrix_with_engineered = loan_data[selected_columns_with_engineered].corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix_with_engineered, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix with Engineered Features')
plt.tight_layout()
plt.show()
No description has been provided for this image
  • Strong positive correlations:
    • loan_amnt and funded_amnt (0.99): This is to be expected, as the funded amount is typically a portion of the loan amount requested.
    • annual_inc and loan_amnt (0.37): Borrowers with higher incomes tend to take out larger loans.
    • total_acc and open_acc (0.67): Borrowers with more open accounts also tend to have more total accounts.
    • loan_term and int_rate_diff (0.44): Borrowers with longer loan terms tend to have higher interest rate differences compared to the average.
  • Strong negative correlations:
    • dti and annual_inc (-0.67): Borrowers with higher debt-to-income ratios tend to have lower incomes.
    • dti and credit_util_ratio (-0.40): Borrowers with higher debt-to-income ratios also tend to have lower credit utilization ratios.
    • int_rate and dti_ratio (-0.29): Borrowers with higher debt-to-income ratios tend to have lower interest rates.
  • Weak correlations:
    • loan_status_binary with most other features: This suggests that loan status is not strongly related to most of the other features in the dataset.
    • issue_month with most other features: This suggests that the month in which the loan was issued is not strongly related to most of the other features in the dataset.
In [ ]:
# Group by 'addr_state' and count the number of loans
loan_states = loan_data.groupby('addr_state')['id'].count().reset_index(name='loans')

# Plot the bar chart
fig, ax = plt.subplots(figsize=(10, 6))
top_10 = loan_states.nlargest(10, 'loans')
top_10.plot.bar(x='addr_state', y='loans', rot=0, ax=ax)
plt.title("Top 10 States by Number of Loans")
plt.xlabel("State")
plt.ylabel("Number of Loans")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Plot the choropleth map
fig = px.choropleth(loan_states, 
                    locations='addr_state', 
                    locationmode='USA-states', 
                    color='loans',
                    scope='usa',
                    title='Number of Loans by State',
                    color_continuous_scale='blues')

# Center and adjust layout:
fig.update_layout(
    margin=dict(l=0, r=0, t=50, b=0),
    geo=dict(
        showframe=False,
        projection_scale=1
    )
)

fig.show()

The map reveals that California (CA) has the highest number of loans, with over 70,000 loans. New York (NY) follows with around 40,000 loans, and Texas (TX) is close behind with approximately 36,000 loans.

  • Focus on 36-Month Terms:

    The majority of loans in the dataset have a 36-month term. Consider continuing to offer and promote these shorter-term loans, as they are more popular among borrowers.

  • Strategic Issuance in High-Performing Months:

    Analyze the factors contributing to the surge in loan issuances in specific months, such as the notable increase in October 2014. Consider replicating successful strategies during those high-performing months.

  • Manage Default Rates:

    Address the higher default rates observed in 2011 and 2012. Evaluate the underwriting criteria during those periods and identify any patterns or risk factors contributing to increased default rates. Implement strategies to mitigate default risks.

  • Diversify Marketing Strategies:

    Explore and diversify marketing strategies to attract a broader customer base. Understand the preferences and characteristics of borrowers with successful repayment histories and target similar demographics.

  • Optimize Interest Rates:

    Analyze the distribution of interest rates and their impact on loan performance. Consider optimizing interest rates to remain competitive while ensuring profitability and managing default risks.

  • Enhance Geographic Presence:

    Assess the geographic distribution of loans and identify regions with high demand. Consider expanding or refining the company's presence in those areas to tap into potential markets.

  • Customer Education and Transparency:

    Enhance customer education regarding loan terms, interest rates, and repayment processes. Transparency in communication can build trust and improve customer satisfaction.

  • Adapt to Economic Conditions:

    Stay vigilant to economic conditions that may impact borrower repayment capabilities. Develop adaptive strategies to navigate economic fluctuations and minimize the impact on loan performance.

  • Continuous Monitoring and Adaptation:

    Establish a robust monitoring system to track key performance indicators over time. Regularly revisit and adapt business strategies based on evolving market conditions and customer behaviors.

  • Optimize Current Loan Portfolio:

    Since a significant portion of loans is labeled as "Current," focus on optimizing the management and servicing of these loans. Implement strategies to minimize the risk of loans transitioning into delinquency.

  • Customer Engagement for Fully Paid Loans:

    For loans labeled as "Fully Paid," leverage this positive outcome to engage with customers. Consider soliciting feedback, offering loyalty incentives for repeat business, and encouraging satisfied customers to refer others.

  • Mitigate Risk for Delinquent Loans:

    Given the lower frequencies in other loan statuses, pay special attention to loans with delinquent statuses. Implement proactive measures such as targeted communication, flexible repayment plans, or refinancing options to mitigate the risk of default.

  • Enhance Communication Strategies:

    Improve communication strategies for borrowers across different loan statuses. Provide clear and transparent communication regarding payment schedules, outstanding balances, and any available support or resources in case of financial difficulties.

  • Customized Marketing for Different Loan Statuses:

    Tailor marketing efforts based on the loan status. For instance, develop campaigns to encourage borrowers with fully paid loans to consider additional financial products or services. For borrowers with delinquent status, provide resources to help them get back on track.

  • Implement Predictive Analytics:

    Utilize predictive analytics to forecast the likelihood of loans transitioning into different statuses. This can aid in early intervention strategies for loans showing signs of potential issues.

  • Customer Retention Programs:

    Design customer retention programs for borrowers with fully paid loans to foster long-term relationships. Offer exclusive benefits, promotions, or loyalty programs to incentivize them to choose the company for future financial needs.

  • Diversify Loan Products:

    Explore diversification of loan products to attract a wider range of customers. Different loan products may appeal to various segments of the market, potentially reducing the concentration of loans in the "Current" status.